12 research outputs found
Abstractive spoken document summarization using hierarchical model with multi-stage attention diversity optimization
Abstractive summarization is a standard task for written documents, such as news articles. Applying summarization schemes to spoken documents is more challenging, especially in situations involving human interactions, such as meetings. Here, utterances tend not to form complete sentences and sometimes contain little information. Moreover, speech disfluencies will be present as well as recognition errors for automated systems. For current attention-based sequence-to-sequence summarization systems, these additional challenges can yield a poor attention distribution over the spoken document words and utterances, impacting performance. In this work, we propose a multi-stage method based on a hierarchical encoder-decoder model to explicitly model utterance-level attention distribution at training time; and enforce diversity at inference time using a unigram diversity term. Furthermore, multitask learning tasks including dialogue act classification and extractive summarization are incorporated. The performance of the system is evaluated on the AMI meeting corpus. The inclusion of both training and inference diversity terms improves performance, outperforming current state-of-the-art systems in terms of ROUGE scores. Additionally, the impact of ASR errors, as well as performance on the multitask learning tasks, is evaluated
Impact of ASR performance on spoken grammatical error detection
Computer assisted language learning (CALL) systems aidlearners to monitor their progress by providing scoring andfeedback on language assessment tasks. Free speaking tests al-low assessment of what a learner has said, as well as how theysaid it. For these tasks, Automatic Speech Recognition (ASR)is required to generate transcriptions of a candidate’s responses,the quality of these transcriptions is crucial to provide reliablefeedback in downstream processes. This paper considers theimpact of ASR performance on Grammatical Error Detection(GED) for free speaking tasks, as an example of providing feed-back on a learner’s use of English. The performance of an ad-vanced deep-learning based GED system, initially trained onwritten corpora, is used to evaluate the influence of ASR errors.One consequence of these errors is that grammatical errors canresult from incorrect transcriptions as well as learner errors, thismay yield confusing feedback. To mitigate the effect of theseerrors, and reduce erroneous feedback, ASR confidence scoresare incorporated into the GED system. By additionally adaptingthe written text GED system to the speech domain, using ASRtranscriptions, significant gains in performance can be achieved.Analysis of the GED performance for different grammatical er-ror types and across grade is also presented.ALT
Recommended from our members
Sparsity and Sentence Structure in Encoder-Decoder Attention of Summarization Systems
Transformer models have achieved state-of-the-art results in a wide range of
NLP tasks including summarization. Training and inference using large
transformer models can be computationally expensive. Previous work has focused
on one important bottleneck, the quadratic self-attention mechanism in the
encoder. Modified encoder architectures such as LED or LoBART use local
attention patterns to address this problem for summarization. In contrast, this
work focuses on the transformer's encoder-decoder attention mechanism. The cost
of this attention becomes more significant in inference or training approaches
that require model-generated histories. First, we examine the complexity of the
encoder-decoder attention. We demonstrate empirically that there is a sparse
sentence structure in document summarization that can be exploited by
constraining the attention mechanism to a subset of input sentences, whilst
maintaining system performance. Second, we propose a modified architecture that
selects the subset of sentences to constrain the encoder-decoder attention.
Experiments are carried out on abstractive summarization tasks, including
CNN/DailyMail, XSum, Spotify Podcast, and arXiv
Long-span summarization via local attention and content selection
Transformer-based models have achieved state-of-the-art results in a wide
range of natural language processing (NLP) tasks including document
summarization. Typically these systems are trained by fine-tuning a large
pre-trained model to the target task. One issue with these transformer-based
models is that they do not scale well in terms of memory and compute
requirements as the input length grows. Thus, for long document summarization,
it can be challenging to train or fine-tune these models. In this work, we
exploit large pre-trained transformer-based models and address long-span
dependencies in abstractive summarization using two methods: local
self-attention; and explicit content selection. These approaches are compared
on a range of network configurations. Experiments are carried out on standard
long-span summarization tasks, including Spotify Podcast, arXiv, and PubMed
datasets. We demonstrate that by combining these methods, we can achieve
state-of-the-art results on all three tasks in the ROUGE scores. Moreover,
without a large-scale GPU card, our approach can achieve comparable or better
results than existing approaches.1. ALTA institute, Cambridge Assessment English, University of Cambridge
2. Cambridge International & St John’s College Scholarshi
Long-Span Summarization via Local Attention and Content Selection
Transformer-based models have achieved state-of-the-art results in a wide range of natural language processing (NLP) tasks including document summarization. Typically these systems are trained by fine-tuning a large pre-trained model to the target task. One issue with these transformer-based models is that they do not scale well in terms of memory and compute requirements as the input length grows. Thus, for long document summarization, it can be challenging to train or fine-tune these models. In this work, we exploit large pre-trained transformer-based models and address long-span dependencies in abstractive summarization using two methods: local self-attention; and explicit content selection. These approaches are compared on a range of network configurations. Experiments are carried out on standard long-span summarization tasks, including Spotify Podcast, arXiv, and PubMed datasets. We demonstrate that by combining these methods, we can achieve state-of-the-art results on all three tasks in the ROUGE scores. Moreover, without a large-scale GPU card, our approach can achieve comparable or better results than existing approaches
Abstractive spoken document summarization using hierarchical model with multi-stage attention diversity optimization
Abstractive summarization is a standard task for written documents, such as news articles. Applying summarization schemes to spoken documents is more challenging, especially in situations involving human interactions, such as meetings. Here, utterances tend not to form complete sentences and sometimes contain little information. Moreover, speech disfluencies will be present as well as recognition errors for automated systems. For current attention-based sequence-to-sequence summarization systems, these additional challenges can yield a poor attention distribution over the spoken document words and utterances, impacting performance. In this work, we propose a multi-stage method based on a hierarchical encoder-decoder model to explicitly model utterance-level attention distribution at training time; and enforce diversity at inference time using a unigram diversity term. Furthermore, multitask learning tasks including dialogue act classification and extractive summarization are incorporated. The performance of the system is evaluated on the AMI meeting corpus. The inclusion of both training and inference diversity terms improves performance, outperforming current state-of-the-art systems in terms of ROUGE scores. Additionally, the impact of ASR errors, as well as performance on the multitask learning tasks, is evaluated
Disfluency Detection for Spoken Learner English
One of the challenges for computer aided language learning (CALL) is providing high quality feedback to learners. An obstacle to improving feedback is the lack of labelled training data for tasks such as spoken”grammatical” error detection and correction, both of which provide important features that can be used in downstream feedback systems One approach to addressing this lack of data is to convert the output of an automatic speech recognition (ASR) system into a form that is closer to text data, for which there is significantly more labelled data available. Disfluency detection, locating regions of the speech where for example false starts and repetitions occur, and subsequent removal of the associated words, helps to make speech transcriptions more text-like. Additionally, ASR systems do not usually generate sentence-like units, the output is simply a sequence of words associated with the particular speech segmentation used for coding. This motivates the need for automated systems for sentence segmentation. By combining these approaches, advanced text processing techniques should perform significantly better on the output from spoken language processing systems. Unfortunately there is not enough labelled data available to train these systems on spoken learner English. In this work disfluency detection and”sentence” segmentation systems trained on data from native speakers are applied to spoken grammatical error detection and correction tasks for learners of English. Performance gains using these approaches are shown on a free speaking test
Impact of ASR performance on spoken grammatical error detection
Computer assisted language learning (CALL) systems aid learners to monitor their progress by providing scoring and feedback on language assessment tasks. Free speaking tests allow assessment of what a learner has said, as well as how they said it. For these tasks, Automatic Speech Recognition (ASR) is required to generate transcriptions of a candidate's responses, the quality of these transcriptions is crucial to provide reliable feedback in downstream processes. This paper considers the impact of ASR performance on Grammatical Error Detection (GED) for free speaking tasks, as an example of providing feedback on a learner's use of English. The performance of an advanced deep-learning based GED system, initially trained on written corpora, is used to evaluate the influence of ASR errors. One consequence of these errors is that grammatical errors can result from incorrect transcriptions as well as learner errors, this may yield confusing feedback. To mitigate the effect of these errors, and reduce erroneous feedback, ASR confidence scores are incorporated into the GED system. By additionally adapting the written text GED system to the speech domain, using ASR transcriptions, significant gains in performance can be achieved. Analysis of the GED performance for different grammatical error types and across grade is also presented
Disfluency Detection for Spoken Learner English
One of the challenges for computer aided language learn-ing (CALL) is providing high quality feedback to learners. Anobstacle to improving feedback is the lack of labelled trainingdata for tasks such as spoken ”grammatical” error detection andcorrection, both of which provide important features that canbe used in downstream feedback systems One approach to ad-dressing this lack of data is to convert the output of an auto-matic speech recognition (ASR) system into a form that is closerto text data, for which there is significantly more labelled dataavailable. Disfluency detection, locating regions of the speechwhere for example false starts and repetitions occur, and subse-quent removal of the associated words, helps to make speechtranscriptions more text-like. Additionally, ASR systems donot usually generate sentence-like units, the output is simplya sequence of words associated with the particular speech seg-mentation used for coding. This motivates the need for auto-mated systems for sentence segmentation. By combining theseapproaches, advanced text processing techniques should per-form significantly better on the output from spoken languageprocessing systems. Unfortunately there is not enough labelleddata available to train these systems on spoken learner English.In this work disfluency detection and ”sentence” segmentationsystems trained on data from native speakers are applied to spo-ken grammatical error detection and correction tasks for learn-ers of English. Performance gains using these approaches areshown on a free speaking test